Pong with PPO (code walkthrough)

Additional Notes

Try normalizing your future rewards over all the parallel agents, it can speed up training
Simpler networks might perform better than more complicated ones! The original input contains 80x80x2=12800 numbers, you might want to ensure that this number steadily decreases at each layer of the neural net.
Training performance may be significantly worse on local machines. I had worse performance training on my own windows desktop with a 4-core CPU and a GPU. This may be due to the slightly different ways the emulator is rendered. So please run the code on the workspace first before moving locally
It may be beneficial to train multiple epochs, say first using a small t_max=200 with 500 episodes, and then train again with t_max = 400 with 500 episodes, and then finally with a even larger t_max.
Remember to save your policy after training!
for a challenge, try the 'Pong-v4' environment, this includes random frameskips and takes longer to train.